Transductive meta-learning with enhanced feature ensemble for few-shot semantic segmentation

This paper addresses few-shot semantic segmentation and proposes a novel transductive end-to-end method that overcomes three key problems affecting performance. First, we present a novel ensemble of visual features learned from pretrained classification and semantic segmentation networks with the same architecture. Our approach leverages the varying discriminative power of these networks, resulting in rich and diverse visual features that are more informative than a pretrained classification backbone that is not optimized for dense pixel-wise classification tasks used in most state-of-the-art methods. Secondly, the pretrained semantic segmentation network serves as a base class extractor, which effectively mitigates false positives that occur during inference time and are caused by base objects other than the object of interest. Thirdly, a two-step segmentation approach using transductive meta-learning is presented to address the episodes with poor similarity between the support and query images. The proposed transductive meta-learning method addresses the prediction by first learning the relationship between labeled and unlabeled data points with matching support foreground to query features (intra-class similarity) and then applying this knowledge to predict on the unlabeled query image (intra-object similarity), which simultaneously learns propagation and false positive suppression. To evaluate our method, we performed experiments on benchmark datasets, and the results demonstrate significant improvement with minimal trainable parameters of 2.98M. Specifically, using Resnet-101, we achieve state-of-the-art performance for both 1-shot and 5-shot Pascal-\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$5^{i}$$\end{document}5i, as well as for 1-shot and 5-shot COCO-\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$20^{i}$$\end{document}20i.


Figures S1 and S2
show the discriminative power calculated using more than 4, 000 episodes from Pascal-5 i , ρ k of each backbone at layers k, 1 ≤ k ≤ |B cls | of the frozen pretrained backbones B cls and B sem .Figures S1 shows the results with Resnet-50, and Figure S2 shows the results with Resnet-101.The discriminative power ρ k at layer k is measured as the ratio where P S is the support prototype calculated by averaging all the foreground support features FG S .The numerator is the average cosine distance of the N foreground query features FG i Q , 0 ≤ i ≤ N to the foreground support prototype FG S , and the denominator is the average cosine distance of the M background query features BG j Q , 0 ≤ j ≤ M to FG S .

Transductive meta-learning
The object in the support image is frequently not visually similar to that in the query image, leading to under and oversegmentation.Table S1, third column (1 st pass) depicts the most frequent cases occurring when matching support foreground features to query features, namely undersegmentation (top), and oversegmentation (middle, bottom).Several strategies have been proposed to use this initial query prediction as an additional source of information to improve results in a second step [1][2][3][4] .
According to the Gestalt principle, the second step can be utilised to refine an undersegmented initial prediction.However, the issue arises when the initial query prediction yields a large number of false positives.To mitigate these cases, SSP method proposed by 3 presented a two-stage method based on the concept of prototyping for refining the initial query segmentation through the selective propagation of query features in the second step.The selective propagation, which is dependent on a user-defined non-adaptive threshold, eliminates gradients and prevents backpropagation throughout the network.In situations where the probability of false positives is greater than the threshold, this not only fails to suppress them but also makes the problem worse by propagating them.To summarize the advantages of our self-refinement approach over Instead of introducing non-differentiable operations like hard-thresholding, as in Fan et al., we address this issue by allowing the network to learn the visual disimilarities between the query foreground features and the false positives in an end-to-end manner.In the first pass, support foreground features are matched to query features, and in the second pass, false positives are suppressed and query foreground features are propagated throughout the query image.The proposed second pass does not introduce new parameters to the network.We use multi-level all-pairs field transforms 5 that result in a multiscale hypercorrelation volume 6 to leverage the different levels of visual features learned at each layer of the backbone.Table S1, fourth column (2 nd pass) demonstrates some instances of our proposed transductive learning method in which the network simultaneously learns to suppress and propagate from initial segmentation.The advantages of our method of self-refinement are summarized below.
1. We adopted 4D-Conv for our self-refinement module, which outperforms the prototyping approach of SSP.   2. Our self-refinement module does not add any additional parameters to the network, whereas the SSP fine-tunes the last two blocks of a ResNet backbone with 1mil parameters.

3.
Our self-refinement module can operate on top of any backbone, which is another significant advantage over SSP which reshapes embedding space for self-refinement.
4. SSP employs a non-differentiable method that uses a user-specified hard-threshold.This restricts the ability to add trainable modules after the non-differentiable operation.We do not use non-differentiable operations.Instead, we enable the network to learn end-to-end the visual differences between the query foreground features and false positives.
Table S1.Results from our two-pass method. 1 st pass: intra-class similarity (S −→ Q). 2 nd pass: intra-object similarity Support Query 1 st pass 2 nd pass

Mitigating Propagation of False Positives
The propagation of false positives can be a significant problem in semantic segmentation, particularly when dealing with complex backgrounds or multiple classes that share similar visual features.As noted by Lang et al. 7 , the presence of base classes in the background of the query image can lead to false positive predictions, as the network may incorrectly classify pixels that are not part of the object of interest.To address this issue, they proposed auxiliary layers on top of a base learner that is trained on base classes to predict whether or not each pixel in the output of the meta learner corresponds to a base class.By using this information to selectively mask out base class predictions, they were able to reduce the number of false positives